Decision Trees and Random Forests

  • We'll take a look at a powerful, non-parametric algorithm called random forests.
  • Random forests are an example of an ensemble method, meaning that it relies on aggregating the results of an ensemble of simpler estimators.
  • A majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

Motivating Random Forests: Decision Trees

  • Random forests are an example of an ensemble learner built on decision trees. For this reason we'll start by discussing decision trees themselves.
  • Decision trees are extremely intuitive: you simply ask a series of questions designed to isolate a sample.
  • For example, if you wanted to build a decision tree to classify an animal, you might construct the tree shown here:

In [2]:
from IPython.display import Image
Image('images/decision-tree.png')


Out[2]:
  • The binary splitting makes this extremely efficient (Given a proper tree). Why?
  • But which questions should we ask?
  • In machine learning implementations of decision trees, the questions generally take the form of axis-aligned splits in the data: that is, each node in the tree splits the data into two groups using a cutoff value within one of the features.

Let's now look at an example of this.

Creating decision trees

Consider the following dataset:


In [3]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');


  • A simple decision tree built on this data will iteratively split the data along one or the axis.
  • At each level the algorithm assigns the label of the new region according to a majority vote of points within it

This figure presents a visualization of the first four levels of a decision tree classifier for this data:


In [8]:
from IPython.display import Image
Image('images/decision_tree.png')


Out[8]:

Why is the uppermost branch not split?

In Scikit-Learn a DecisionTreeClassifier estimator is used to construct Decision Trees.


In [9]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)

In [10]:
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, clim=(y.min(), y.max()),
                           zorder=1)

    ax.set(xlim=xlim, ylim=ylim)

In [13]:
visualize_classifier(DecisionTreeClassifier(), X, y)



In [22]:
from helper.interactive_tree import plot_tree_interactive
plot_tree_interactive(X, y);


What do you think is the best classifier?

Decision trees and overfitting

  • A model overfits the data when it learns the training data rather than the overall properties of the distribuition they came from.
  • DTs tend to overfit

In the following example we split the dataset and train a tree on each of the splits:


In [24]:
from IPython.display import Image
Image('images/decision-tree-overfitting.png')


Out[24]:
  • In some regions the trees produce consistent results
  • In other regions they do not agree
  • Inconsistencies happen where different class datapoints are near
  • Can we use both trees to enhance the prediction?

In [25]:
from helper.interactive_tree import randomized_tree_interactive
randomized_tree_interactive(X, y)


Ensembles of Estimators: Random Forests

  • Idea: multiple overfitting estimators can be combined to reduce the effect of this overfitting. This is called bagging.
  • Bagging makes use of an ensemble of parallel estimators (each of which overfits the data) and averages the results to find a better classification.
  • An ensemble of randomized decision trees is known as a random forest.

This type of bagging classification can be done manually using Scikit-Learn's BaggingClassifier meta-estimator, as shown here:


In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()
bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8, random_state=1)

bag.fit(X, y)
visualize_classifier(bag, X, y)


We've built a Random Forest by hand. But Scikit-Learn comes with a RandomForestClassifier estimator which is easier to handle:


In [31]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)
visualize_classifier(model, X, y);


Final thoughts

  • Random forests are fast to train and predict very fast as well.
  • They allow parallel execution.
  • Random forests can inform us about their confidence in the prediction.
  • Random forests are hyper-parameter free.

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: